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INTRODUCTION 


This  project  is  aimed  at  developing  a  novel  genome-wide  method  for  identifying  the 
direct  chromosomal  targets  of  transcription  factors  that  are  important  in  breast  cancer. 
Cancer  involves,  at  least  in  part,  aberrant  programs  of  gene  expression  often  mediated 
by  oncogenic  transcription  factors  activating  downstream  target  genes.  Distinguishing 
between  direct  and  indirect  targets  of  transcription  factors  is  important  for  reconstructing 
the  transcriptional  regulatory  networks  that  underlie  complex  gene  expression  programs 
that  are  activated  in  cancer.  Indeed,  transcription  factors  have  been  proposed  as  targets 
of  anti-cancer  therapy  [1].  Identification  of  the  target  genes  of  oncogenic  transcription 
factors  is  therefore  of  great  interest  and  an  area  of  intensive  investigation. 

The  chromosomal  targets  of  transcription  factors  can  be  identified  by  the  technique  of 
ChIP-chip,  where  DNA  occupied  by  a  transcription  factor  is  first  isolated  after 
crosslinking  and  immunoprecipitation,  then  identified  by  hybridization  to  a  microarray  of 
regulatory  elements.  One  limitation  with  the  ChIP-chip  approach  is  the  lack  of 
availability  of  truly  comprehensive  promoter  microarrays  covering  the  entire  human 
genome.  In  the  Statement  of  Work,  we  proposed  to  develop  a  new  genomic  method 
named  STAGE  (Sequence  Tag  Analysis  of  Genomic  Enrichment)  that  can  potentially 
overcome  some  of  the  limitations  of  ChIP-chip  analysis  and  can  be  applied  to 
transcription  factors  important  in  breast  cancer  such  as  c-myc  and  ER  (estrogen 
receptor).  STAGE  is  based  on  high-throughput  sequencing  of  concatamerized  tags 
derived  from  DNA  associated  with  transcription  factors  that  is  isolated  by  chromatin 
immunoprecipitation.  The  application  of  STAGE  to  oncogenic  transcription  factors 
involved  in  breast  cancer  can  help  elucidate  their  role  in  gene  regulatory  networks 
underlying  the  disease. 


BODY 

The  overall  objective  of  this  Idea  Award  was  to  develop  a  novel  method  for  identifying 
the  direct  transcriptional  targets  of  oncogenic  transcription  factors  that  are  important  in 
breast  cancer.  We  conceived  of  STAGE  (Sequence  Tag  Analysis  of  Genomic 
Enrichment)  as  a  novel  genome-wide  method  for  identifying  the  chromosomal  targets  of 
DNA  binding  proteins  in  any  sequenced  genome.  STAGE  is  conceptually  derived  from 
SAGE  (Serial  Analysis  of  Gene  Expression  ),  but  is  applied  to  DNA  isolated  after 
chromatin  immunoprecipitation  (ChIP).  The  steps  of  the  procedure  in  brief  are  as  follows 
(depicted  in  Figure  1  of  attached  publication,  Kim  et  al):  (1)  DNA-associated  proteins 
are  crosslinked  to  their  target  sites  in  vivo  by  treating  cells  with  formaldehyde. 
Crosslinked  chromatin  is  extracted,  sheared  by  sonication,  and  immunoprecipitated  with 
a  specific  antibody  against  a  given  protein  of  interest  (the  standard  ChIP  procedure).  (2) 
The  recovered  DNA  fragments  are  amplified  by  PCR  using  partially  degenerate 
biotinylated  primers  and  (3)  digested  with  the  restriction  endonuclease  Nlalll,  which  cuts 
at  5'-CATG.  Biotinylated  fragments  are  isolated  using  streptavidin  beads  and  split  into 
two  pools.  (4)  Each  pool  is  ligated  to  a  different  linker  containing  a  recognition  site  for 
Mmel,  a  type  IIS  enzyme,  as  well  as  a  specific  primer  recognition  sequence.  (5) 
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Digestion  with  Mmel  releases  21-basepairtags  that  are  anchored  to  Nialll  sites  derived 
from  the  original  DNA  fragments  enriched  after  ChIP.  (6)  The  two  pools  are  ligated  and 
amplified  by  nested  PCR  using  the  specific  primer  sequences  that  were  ligated.  (7)  The 
ends  are  trimmed  by  Nialll  digestion  to  generate  ditags.  Ditags  at  steps  6  and  7  are  gel 
purified  to  ensure  correct-sized  fragments  are  used  in  subsequent  steps.  (8)  Multiple 
ditags  are  concatamerized,  cloned  and  sequenced.  Mapping  each  tag  back  to  the 
genome  sequence  can  identify  the  DNA  fragments  that  were  present  in  the  ChIP 
sample  and  thus  identify  protein  binding  loci.  STAGE  is  thus  a  novel  high-throughput 
experimental  approach  for  identifying  protein  binding  loci  that  does  not  rely  on  any 
assumptions  about  the  location  and  distribution  of  these  binding  sites  on  the  genome.  It 
is  therefore  a  useful  method  to  identify  functional  elements  in  the  genome.  The 
coverage  or  comprehensiveness  of  this  method  is  limited  in  principle  only  by  the  depth 
of  sequencing,  which  is  a  reliable,  widely  available,  automatable,  inexpensive  and  high- 
throughput  procedure.  The  fact  that  sequencing  is  performed  on  concatamerized 
(di)tags  makes  this  a  truly  genomic  method  that  can  be  quantitative.  By  analogy  to 
SAGE,  the  number  of  times  one  observes  a  given  tag  within  a  STAGE  pool  or  library 
should  be  directly  proportional  to  the  enrichment  of  that  sequence  in  the  original  ChIP 
DNA  pool.  Since  it  is  not  dependent  on  microarray  resources  (but  can  complement 
ChIP-chip  analyses),  STAGE  is  applicable  to  any  sequenced  genome. 

The  original  proposal  was  focused  on  identifying  the  targets  of  c-myc  and  estrogen 
receptor  (ER)  in  human  MCF-7  cells.  We  initially  encountered  some  technical  problems 
with  regard  to  the  use  of  suitable  antibodies  that  would  reliably  immunoprecipitate  c-myc 
and  ER  from  cross-linked  extracts.  Our  proposed  STAGE  method,  like  ChIP-chip, 
depends  on  the  immunoprecipitation  of  proteins  of  interest  crosslinked  to  their  target 
promoters.  We  therefore  decided  to  use  E2F4  as  a  candidate  protein  to  develop  our 
method  and  demonstrate  its  utility  in  identifying  novel  chromosomal  targets  of  any 
protein  of  interest  in  cancer  cells.  The  proof-of-principle  of  the  utility  of  the  approach  we 
proposed  to  develop  during  this  project  has  recently  been  published  as  an  Article  in 
Nature  Methods  [2].  We  have  thus  been 
successful  in  accomplishing  the  core  tasks  in 
our  original  statement  of  work,  with  the  only 
difference  being  that  it  was  applied  to  E2F4 
rather  than  c-myc  and  ER.  This  publication  is 
attached  along  with  this  report  in  the  Appendix 
and  we  refer  to  it  as  "Kim  et  at  (2005)"  in  our 
accounting  of  the  tasks  below.  E2F4  is  a 
member  of  the  E2F  family  of  transcription 
factors  that  are  associated  with  cell 
proliferation  and  cancer,  and  E2F4  has 
complex  roles  in  breast  cancer.  It  has  been 
reported  to  be  a  tumor  suppressor  gene  [3]  as 
well  as  an  oncogene  [4],  It  is  also  reported  to 
mediate  the  transcription  of  ER-alpha,  (the 
focus  of  our  Statement  of  Work)  in  breast 
cancer  [5]. 
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Below  we  describe  our  progress  with  regard  to 
each  of  the  specific  tasks  in  the  Statement  of 
Work,  using  E2F4  as  an  example  transcription 
factor.  Detailed  presentations  of  data  are 
referenced  in  the  attached  publication.  We  have 
indicated  how  they  related  to  each  of  the  specific 
tasks  and  sub-tasks  in  the  approved  Statement  of 
Work.  These  details  are  not  duplicated  in  the  body 
of  this  report. 

At  the  same  time  we  have  obtained  and 
successfully  used  antibodies  that  will  work  for 
immunoprecipitating  c-myc  and  ER  (Figure  1).  We 
also  tested  the  performance  of  anti-myc 
antibodies  for  chromatin  immunoprecipitation  and 
successfully  validated  the  binding  of  c-myc  to  its 
target  promoters  in  vivo  (Figure  2).  We  are 
currently  in  the  process  of  generating  STAGE 
libraries  for  large-scale  identification  of  target 
genes  for  c-myc  and  ER  using  the  same  strategy 
we  have  successfully  developed  using  E2F4  as  an 
example,  as  described  below. 

Task  1  Develop  a  novel  genomic  method  termed 
Sequence  Tag  Analysis  of  Genomic  Enrichment 
(STAGE)  to  identify  the  direct  transcriptional  targets 
(ER)  in  human  cells 

We  have  optimized  the  experimental  methods  for  successful  application  of  STAGE  in 
mammalian  cells  (Task  l.a).  Specifically,  we  introduced  two  modifications  to  the 
STAGE  method  in  order  to  achieve  success  in  human  cells.  First,  we  implemented  a 
subtraction  strategy  to  improve  the  enrichment  of  DNA  associated  with  proteins  relative 
to  the  overall  background.  This  method  is  described  in  Figure  3b  of  Kim  et  al  in  the 
Appendix.  Table  1  below  summarizes  the  overall  number  of  STAGE  tags  we 
sequenced,  both  from  E2F4  targets  in  mammalian  cells  as  well  as  proof-of-principle 
experiments  in  yeast.  Secondly,  we  developed  an  analysis  strategy  that  scored  STAGE 
tags  based  on  their  enrichment  in  the  ChIP  sample  relative  to  unselected,  normal 
human  genomic  DNA.  We  generated  a  STAGE  library  from  normal  human  genomic 
DNA  and  profiled  the  background  occurrence  of  tags  to  control  for  background  in  the 
ChIP  sample.  This  is  described  in  more  detail  in  Kim  etal  in  the  appendix. 


*®v 
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RAD54 
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SNRPA 
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EXOI 
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Figure  2  Binding  of  c-myc  to  its 
chromosomal  targets  in  vivo 
compared  to  a  negative  control.  DNA 
enriched  by  Chromatin 
immunoprecipitation  (ChIP)  is 
analyzed  by  PCR.  PCR  primers  were 
designed  to  binding  sites  in  the 
promoter.  Note  the  enrichment  of 
promoter  loci  from  targets  relative  to 
the  negative  control,  when  comparing 
ChIP  DNA  to  input  genomic  DNA. 


of  c-myc  and  Estrogen  Receptor 
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Table  1 :  Summary  of  STAGE  tags  sequenced 

Within  each  set  of  two  rows,  the  first  row  shows  the  total  number  of  tags  in  each 
category,  followed  by  the  number  of  distinct  tags  after  duplicates  etc.  have  been 
removed. 


Total  tags 

Orphan 
tags  a 

Positive 
tags  b 

'One-hit' 
tags c 

Multiple-hit 

(ambiguous) 

tags 

E2F4  STAGE 

2830 

2405 

1386 

1019 

2256 

1918 

1279 

639 

E2F4 

SubSTAGE 

1696 

607  (36%) 

1089 

742 

347 

1253 

896 

646 

250 

7990 

1165  (15%) 

6825 

3277 

3548 

5523 

4707 

3007 

1700 

Yeast  TBP 
STAGE 

1344 

294  (22%) 

1050 

617 

433 

763 

631 

528 

103 

Yeast  HSF 
STAGE  e 

320 

56  (17%) 

264 

168 

96 

259 

209 

155 

54 

a  -  The  number  of  tags  that  do  not  match  any  sequence  in  the  genome.  The  overall 
proportion  of  orphan  tags  is  comparable  to  that  seen  in  SAGE, 
b  -  The  number  of  tags  that  show  a  perfect  match  to  genomic  sequence, 
c  -  The  number  of  tags  that  show  only  one  match  to  a  single  position  on  the  genome 
d  -  A  subset  of  the  genomic  background  tags,  approximately  of  the  same  size  as  the 
E2F4  STAGE  tags,  was  used  to  normalize  STAGE  scores  as  described  in  the 
paper. 

e  -  Analysis  of  yeast  HSF  STAGE  tags  was  not  included  in  the  published  version  of  Kim 
et  al. 

Progress  for  Task  l.b,  which  was  the  application  of  STAGE  to  c-myc  and  ER  is 
described  above  (page  6).  Task  l.b  and  l.c  involved  the  identification  of  targets  of  c- 
myc  and  ER  in  different  cells  and  at  different  time  intervals  after  stimulation.  We  have 
successfully  carried  out  chromatin  immunoprecipitation  for  ER  in  MCF-7  cells  in  the 
presence  and  absence  of  estradiol.  In  parallel  we  have  developed  human  core  promoter 
microarrays  in  the  lab  based  on  a  published  primer  set  [6],  which  enables  the 
identification  of  target  promoters  by  ChIP-chip.  These  chromatin  IP  samples  for  ER  and 
the  ones  currently  in  progress  for  c-myc  will  be  analyzed  by  STAGE. 
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Task  2  Validation,  analysis  and  interpretation  of  direct  targets  of  c-myc  and  ER 
identified  by  STAGE 

We  have  successful  validated  target  promoters  predicted  by  STAGE  (Task  2.a)  both  by 
promoter  specific  PCR  as  well  as  by  ChIP-chip.  This  data  is  shown  for  E2F4  in  Figures 
3c  and  Figure  4  in  Kim  et  al  in  the  Appendix.  We  have  carried  out  a  preliminary  analysis 
of  DNA  motifs  present 
in  the  promoters  of 
target  genes  (Task 
2.b),  and  this  data  is  in 
Table  2  of  Kim  et  al. 

We  have  also 
analyzed  c-myc  and 
ER  target  genes  by 
ChIP-chip  using  a  core 
promoter  microarray. 

These  target  genes 
will  serve  to  cross- 
validate  the  predicted 
targets  as  identified  by 
STAGE  under  Task 
2.a.  We  have 
developed 
computational 
algorithms  to  enable 
the  mapping  of 
mammalian  STAGE 
tags  to  the  genome 
and  score  tags  as 
targets  (Task  2.b  and 
Task  2.c).  We  have 

begun  developing  a  Java  based  relational  database  termed  ArrayPlex  that  is  capable  of 
storing  data  on  DNA  protein  interactions  and  facilitate  mining  of  such  data  to  identify 
transcriptional  regulatory  networks  (Figure  3).  Although  ArrayPlex  is  originally  intended 
to  store  ChIP-chip  data  from  microarrays,  it  is  flexible  enough  that  it  can  accept  any  list 
of  target  promoters  for  a  transcription  factor  such  as  that  generated  by  STAGE.  We  are 
currently  developing  analysis  routines  within  the  extensible  ArrayPlex  framework  that 
will  facilitate  data  mining  of  STAGE  data  and  elucidate  transcriptional  regulatory 
networks  mediated  by  oncogenic  transcription  factors. 
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ArrayPlex  -  functional  genomic  analysis  and  visual'zation 

a  ir ?ii? 1  jfilfjff 

t  vf>>  :  ! 

’  ArrayPlex  Online  Help  ’ 


|  ArrayPlex  Authentication 


Usemame  ;pkilIion  Password  i  ****  (  Authentic  are  ;  Beset 

Status:  User  Authenticated  -  ArrayPlex  Resources  Now  Unlocked 


ArrayPlex 


genomic 

i  annotation,  and  visualization 


Patrick  kiliion,  iyer  lab 
university  of  texas  @  austin 


Server  chipmunk. icmb. utexas.edu 


Port  9080  Current  Stattft  Connected 


ArrayPex  -  fonet  o«al  genomic  analys-s  and  visualization 

Figure  3  ArrayPlex  screenshot.  ArrayPlex  is  an  open-source  Java  based 
client-server  we  have  developed  to  store  the  data  on  target  promoters.  It  was 
designed  to  store  data  from  ChIP-chip  experiments,  but  it  is  capable  of 
storing  similar  data  entered  as  simple  text  files  from  any  experimental 
strategy  such  as  STAGE. 
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KEY  RESEARCH  ACCOMPLISHMENTS 


■  Developed  STAGE  (Sequence  Tag  Analysis  of  Genomic  Enrichment),  a  novel 
method  to  identify  the  chromosomal  targets  of  transcription  factors  including 
oncogenic  transcription  factors.  We  have  demonstrated  proof-of-concept  with  the 
transcription  factor  E2F4. 

■  Optimized  protocols  and  computational  analysis  methods  for  successful  application 
of  STAGE  in  mammalian  cells. 

■  Developed  software  for  matching  STAGE  derived  tags  to  the  human  genome 
sequence  and  scoring  tags  as  representative  of  target  promoters. 

■  Verified  target  promoters  predicted  by  STAGE  through  independent  means  such  as 
promoter  specific  PCR  and  microarray  hybridization  (ChIP-chip). 

■  Began  development  of  ArrayPlex,  a  Java  based  relational  database  and  analysis 
platform  for  data  mining  of  large-scale  target  promoter  data. 


REPORTABLE  OUTCOMES 

"  Published  our  proof-of-concept  of  successful  development  of  STAGE,  a  method  to 
identify  targets  of  oncogenic  transcription  factors  in  human  cells. 

Kim  J.,  Bhinge  A.A.,  Morgan  X.C.  &  Iyer  V.R.  (2005)  Mapping  DNA-Protein 
Interactions  in  Large  Genomes  by  Sequence  Tag  Analysis  of  Genomic  Enrichment 
Nature  Methods  2:  47-53 

■  Platform  Presentation:  "Genome-wide  mapping  of  DNA  -protein  interactions  in  large 
genomes  by  STAGE  —  Sequence  Tag  Analysis  of  Genomic  Enrichment"  at  the  Cold 
Spring  Harbor  meeting  on  "Systems  Biology:  Genomic  Approaches  to 
Transcriptional  Regulation",  March  2004 

■  Platform  Presentation  at  the  American  Society  of  Microbiology  (Texas)  meeting, 
Houston,  November  2004 

-  Grant  award  R01  HG003532-01  V.  Iyer  (PI)  from  NIH/NHGRI  "STAGE  and  FAIRE 
for  Regulatory  Element  Identification"  This  is  a  technology  development  project 
funded  under  the  ENCODE  Consortium.  It  is  a  collaboration  between  my  lab  and  the 
lab  of  Dr.  Jason  Lieb  at  University  of  North  Carolina  at  Chapel  Hill.  The  objective  is 
to  develop  and  combine  STAGE  with  another  novel  technology  for  identifying  protein 
binding  sites  and  regulatory  elements  in  the  human  genome. 
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CONCLUSIONS 


Our  proof-of-concept  of  the  development  of  STAGE  (Kim  et  al)  illustrates  that  it  is 
possible  to  identify  the  chromosomal  targets  of  a  sequence  specific  transcription  factor 
in  human  cells.  Our  results  so  far  suggest  that  it  should  be  possible  to  use  STAGE  to 
identify  the  chromosomal  targets  of  any  DNA  binding  protein,  including  oncogenic 
transcription  factors  that  are  important  in  breast  and  other  cancers.  Identification  of 
direct  chromosomal  targets  of  transcription  factors  will  help  elucidate  the  transcriptional 
regulatory  networks  mediated  by  oncogenes  during  cell  proliferation.  The  significance  of 
STAGE  is  that  it  is  a  genome-wide  method  that  overcomes  many  of  the  limitations  of 
ChIP-chip  using  promoter  arrays.  Truly  comprehensive  promoter  microarrays  that  cover 
all  possible  regulatory  regions  from  the  human  genome  are  not  yet  widely  available. 
However,  we  are  yet  to  characterize  the  true  extent  of  coverage  of  targets  by  STAGE. 

By  analogy  to  STAGE,  we  expect  that  the  greater  the  depth  of  sequencing  of  STAGE 
tags,  the  better  the  coverage  of  target  genes. 
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Identifying  the  chromosomal  targets  of  transcription  factors  Here,  we  address  some  of  these  limitations  by  developing  an 

is  Important  for  reconstructing  the  transcriptional  regulatory  unbiased  genomic  method  to  identify  the  chromosomal  targets  of 

networks  underlying  global  gene  expression  programs.  We  have  transcription  factors.  We  term  this  method  STAGE,  and  it  is  based 

developed  an  unbiased  genomic  method  called  sequence  tag  on  high-throughput  sequencing  of  concatemerized  tags  derived 

analysis  of  genomic  enrichment  (STAGE)  to  identify  the  direct  from  DNA  enriched  by  ChIP.  Cloning  and  sequencing  of  ChIP 

binding  targets  of  transcription  factors  in  vivo.  STAGE  is  based  on  DNA  has  been  carried  out  previously11,  but  these  efforts  did  not 

high-throUghput  sequencing  of  concatemerized  tags  derived  from  constitute  a  high-throughput  genomic  approach.  As  a  demonstra- 
target  DNA  enriched  by  chromatin'immunoprecipitation.  We  first  tion  of  its  utility,  we  first  used  STAGE  to  map  the  targets  of  TATA- 

used  STAGE  in  yeast  to  confirm  that  RNA  polymerase  HI  genes  box  binding  protein  (TBP)  in  yeast  We  then  optimized  STAGE  and 

are  the  most  prominent  targets  of  the  TATA-box  binding  protein.  developed  analysis  algorithms  that  enabled  us  to  successfully  use 

We  optimized  the  STAGE  protocol  and  developed  analysis  STAGE  to  identify  several  known  and  new  binding  targets  of 

methods  to  allow  the  identification  of  transcription  factor  transcription  factor  E2F4  in  human  cells, 

targets  In  human  cells.  We  used  STAGE  to  identify  several 
previously  unknown  binding  targets  of  human  transcription  RESULTS 

factor  E2F4  that  we  independently  validated  by  promoter-  STAGE  identifies  chromosomal  targets  in  yeast 

specific  PCR  and  micro  array  hybridization.  STAGE  provides  a  STAGE  is  conceptually  derived  from  serial  analysis  of  gene  expres- 

means  of  Identifying  the  chromosomal  targets  of  DNA-associated  sion  (SAGE)12-13,  but  the  template  for  STAGE  consists  of  genomic 

proteins  in  any  sequenced  genome.  loci  enriched  by  ChIP.  Briefly,  transcription  factors  are  cross-linked 

to  their  target  sites  in  vivo  with  formaldehyde.  After  ChIP  with  a 
Determining  the  binding  sites  ofregulatory  proteins  on  the  genome  specific  antibody  against  a  given  transcription  factor,  the  recovered 

is  important  for  reconstructing  transcriptional  regulatory  net-  DNA  fragments  are  amplified  by  PCR  using  biotinylated  degenerate 

works1-3.  The  binding  of  a  transcription  factor  to  its  genomic  primers  and  digested  with  the  four-base  cutter  (5'-CATG)  restric- 

’  targets  can  be  assayed  by  combining  chromatin  immunoprecipita-  tion  endonuclease  NldSl I.  The  biotinylated  fragments  are  isolated 

tion  (ChIP)  and  microarray  (chip)  hybridization.  This  ChIP-chip  using  streptavidin  beads  and  ligated  to  linkers  containing  a  recog- 

method  was  first  developed  for  yeast4,  where  it  has  been  used  to  nition  site  for  Mme I,  a  type  IIS  restriction  enzyme.  Digestion  with 

define  the  targets  of  more  than  100  transcription  factors2-5,6.  Mmel  releases  21-base-pair  (bp)  tags  containing  Main  sites  from 

Although  ChIP-chip  has  also  enabled  the  identification  of  DNA  fragments  enriched  after  ChIP.  Multiple  tags  are  concatemer- 

transcription  factor  targets  in  human  cell/-8,  it  is  challenging  to  ized,  cloned  and  sequenced.  STAGE  generates  21-bp  tags  derived 

apply  this  approach  comprehensively  to  study  large  and  complex  from  ChIP  DNA  (Fig.  1).  Mapping  these  tags  to  the  genome  can 

genomes.  Human  promoter  microarrays  based  on  core  promoters7  identify  the  loci  represented  in  the  CHIP  sample  and  thus  identify 
or  CpG  islands8  cover  a  subset  of  all  potential  regulatory  regions  protein-binding  locations. 

and  may  not  adequately  represent  regions  that  are  distant  from  We  first  used  STAGE  to  identify  the  targets  of  yeast  TATA-box 

genes  or  within  introns.  Tiling  arrays  of  polymerase  chain  reaction  binding  protein  (TBP).  Out  of  a  total  of  1344  sequenced  tags, 

(PCR)  products9  or  oligonucleotides10  have  been  made  for  the  294  (22%)  did  not  match  any  sequence  in  the  yeast  genome.  The 
smallest  human  chromosomes,  but  extending  such  arrays  to  cover  total  number  of  sequenced  tags  and  the  number  of  orphan  and 

the  entire  genome  is  expensive,  and  the  arrays  are  currently  ambiguous  tags  are  provided  in  Supplementary  Ihble  1  online, 

unavailable  to  most  researchers.  Although  these  efforts  are  under-  Out  of  1,050  valid  STAGE  tags,  433  showed  multiple  hits  on  the 

way  for  the  human  genome  and  some  model  organisms,  the  genome  and  could  not  be  assigned  to  a  single  gene;  77  tags  had 

development  of  similar  platforms  for  the  mouse,  plants,  proka-  single  hits  but  had  no  annotated  genes  within  one  kilobase  (kb), 

ryotes  and  many  other  model  organisms  is  lagging.  The  remainder  comprised  437  distinct  tags,  each  of  which  had 
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Rgure  1 1  The  STAGE  strategy.  STAGE  is  based  on  high-throughput  sequencing 
of  concatemerized  tags  of  defined  length  that  are  derived  from  DNA  enriched 
by  ChIP.  Proteins  were  cross-linked  to  their  binding  sites  in  vivo  with 
formaldehyde  and  chromatin  was  extracted  and  sheared.  The  cross-linked 
protein-DNA  complexes  were  immunoprecipitated,  cross-links  were  reversed 
and  ChIP  DNA  was  amplified  by  PCR  using  biotinylated  primers.  Amplified  DNA 
fragments  were  digested  with  /Wain,  which  cuts  at  5'-CATG  sites.  Fragments 
with  ends  containing  the  NlalU  site  were  isolated  by  binding  to  streptavidin 
beads.  They  were  separately  ligated  to  one  of  two  linkers  containing  a 
Afmel  site,  then  incubated  with  Afmel,  which  cleaves  21  bp  away  from  its 
recognition  site.  The  21-bp  tags  attached  to  linkers  were  isolated  and  ligated 
to  create  ditags.  Ditags  were  amplified  by  PCR  using  nested  primers  and 
trimmed  by  digesting  with  NlalU.  Trimmed  ditags  were  gel  purified, 
concatemerized  by  ligation,  cloned  and  sequenced. _ 


only  one  hit  on  die  yeast  genome  and  was  located  within  1  kb  of 
the  start  of  a  gene. 

Of  the  437  tags,  378  occurred  only  once  in  the  STAGE  pool  and 
59  occurred  multiple  times.  Seventy-nine  putative  targets  were 
represented  by  more  than  one  tag  occurrence.  The  one  notable 
feature  of  the  abundant  tags  was  that  a  substantial  majority  mapped 
within  1  kb  of  an  RNA  polymerase  HI  (pol  HI)  promoter.  Based  on 
this  observation  and  on  the  feet  that  pol  HI  promoters  are 
prominent  targets  of  TBP14-15,  we  assigned  the  gene  with  a  pol  III 
promoter  as  die  putative  target  when  a  tag  mapped  near  it  In  other 
cases,  the  nearest  gene  was  assigned  as  the  putative  target  Tags  that 
occurred  multiple  times  in  the  STAGE  pool,  as  well  as  their  putative 
targets,  are  listed  (Table  1).  Sixty-eight  of  79  targets  represented  by 
multiple  tags  were  genes  with  an  RNA  pol  HI  promoter.  STAGE  thus 
identified  many  prominent  chromosomal  targets  of  TBP  in  yeast. 


Validation  of  STAGE  targets  by  microarray  hybridization 

To  compare  our  STAGE  targets  to  those  identified  by  microarray 
hybridization,  ChIP  DNA  samples  were  fluorescently  labeled  and 

cohybridized  to  whole-genome  (ORFs  +  intergenic  regions)  micro¬ 
arrays  with  an  amplified  genomic  DNA  reference.  The  occupancy 


Table  1 1  High-abundance  yeast  TBP  STAGE  tags 


Tag  sequence 

nocc 

Target  gene 

CATGATGGAAACGAAGACGAC 

10 

tF(GAA)B 

CATGAGAATGTGCTTCAGTAT 

8 

tF(GAA)B 

CATGAAGGTGACAAAATGATT 

5 

tK(CUU)  El 

CATGATCAAATTCTGTGAAGC 

5 

tL(CAA)A 

CATGCAAATCTAAATAAAAAC 

5 

tH(GUG)H 

CATGTATACTTAACAGATATG 

5 

RDN  5-1 

CATGAGATATGCTGTTTCAAG 

4 

tL(CAA)A 

CATGTATATATTGCACTGGCT 

4 

RDN  5-1 

CATGAAACTAGGAAAACGTAC 

3 

tE(UUC)J 

CATGAAGATGATTCGATACCG 

3 

tV(AAC)Ml 

CATGATGAAGTTTAGATCTGC 

3 

tW(CCA)K 

CATGATGGCAGACTTCCATCG 

3 

tV(AAC)G2 

CATGATGTCGCTATTTCTAAT 

3 

tY(GUA)J2 

CATGCAAGATGTAGACCCAAC 

3 

YGRWct5 

CATGCAATCCCAGTAGTAGGT 

3 

SCR1 

CATGCAGCTGTTGTATCAAGA 

3 

tV(AAC)Gl 

CATG  CATGTTTTACGTTGTGG 

3 

tP(AGG)N 

CATGGAATGTGCAATTAAGAC 

3 

tT(AGU)N2 

CATGTGGTGTAAAAAGATAAC 

3 

tT(AGU)J 

CATGTTATCCTGAGCATCCAC 

3 

tG(GCQ02 

CATGTTTACCCTCAAACAAAG 

3 

tV(AAC)Kl 

CATGTTTCCTCTAAAGATGGT 

3 

tR(UCU)B 

CATGAAAACCTCICAAACCTT 

2 

tH(GUG)El 

CATGAAAAGGTTTAATGACTT 

2 

tT(AGU)0l 

CATGAAGACCTATfCGCTTAT 

2 

tV(AAC)G3 

CATGAAGCGCACAAGATTGGA 

2 

tR(UCU)G3 

CATGAATGGCGCAGATTTATT 

2 

tV(AAC)Ml 

CATGAGGCGCACTTTTGATTT 

2 

tY(GUA)F2 

CATGAGTTGCCATTAGAAACG 

2 

tW(CCA)Gl 

CATGATACTGACTTATTGGGC 

2 

tD(GUQLl 

CATGCAAGACGTAGACCCAAC 

2 

tI(AAU)I2 

CATGCAAGTGTGGCATAAAAG 

2 

tK(CUU)E2 

CATGCAGAAAAGATAAGATGC 

2 

YPL029W 

CATGCCTGTGCAACGCCGCAG 

2 

tE(UUC)J 

CATGCTCGGCAATAGCTTCAA 

2 

tG(CCQD 

CATGli  I IGTCT ICCTGTTAG 

2 

tP(UGG)02 

CATGGAAAAACGAATGGAGAC 

2 

tA(AGC)Kl 

CATGGAAATCGAACCTTTCAC 

2 

tN(GUU)N2 

CATGGAGTCTAACTTTGTTGT 

2 

tN(GUU)02 

CATGGAGTCTTTTATTTCCGA 

2 

tN(GUU)L 

CATGGCAAAAACTGTAAAGTT 

2 

tR(UCU)G2 

CATGGCGAATTTTTCACATAT 

2 

tV(UAC)D 

CATGGCGATTATrTCATTATG 

2 

tR(UCU)G3 

CATGGCTAGTCAAATAAGTGG 

2 

YGL080W 

CATGGGGTAAGTTCCGATGGC 

2 

tV(AAC)E2 

CATGGGTTCAAACACTTCCAA 

2 

tY(GUA)Fl 

CATGGTGAAAGTTTAATCTTT 

2 

tR(ACG)K 

CATGTAAACCATCCCTTTTCA 

2 

YJL005W 

CATGTATAAAACCTACCGCTT 

2 

tS(CGA)C 

CATGTATCAAATTGCACGTGA 

2 

YPRC522 

CATGTATGAAACTGGGAATTC 

2 

tS(AGA)B 

CATGTCAATGTCCATTTCTTT 

2 

tT(AGU)I2 

CAT GTCTTTT GTGGATTATTT 

2 

tS(CGA)C 

CATGTGAGGCTTAGGTGATTT 

2 

tN(GUU)N2 

CATGTGTTTGAATTAGCGATC 

2 

tL(CAA)A 

CATGTTACAATTCCTTTCCAT 

2 

tG(UCC)G 

CATGTTATGTTCAATTGGCAG 

2 

YELCtI 

CATGTTCAAG  G  ACGGCTTGGT 

2 

tD(GUC)Jl 

CATGTTTTCGTTATTTCATAA 

2 

tR(UCU)B 

Tags  that  occurred  more  than  once  are  listed,  including  the  4-bp  Main  site  (5'-CATG).  The 
number  of  times  the  tag  occurred  in  the  STAGE  pool  is  indicated  bynocc.  Target  genes  were 
designated  as  described  in  the  text 
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Figure  2  |  Correlation  between  yeast  targets  predicted  by  STAGE  and  ChIP- 
chip.  The  enrichment  value  of  yeast  TBP  targets  after  ChIP  was  determined 
by  microanay  hybridization.  The  percentile  rank  (0-100)  of  the  ratio  of 
ChIP-enriched  fragments  to  genomic  DNA  was  used  to  determine  the  ChIP 
enrichment  Value  for  each  locus.  For  each  interval  of  TBP  ChIP  enrichment 
values  plotted  on  tlfe  x-axis,  the  number  of  targets  predicted  by  STAGE  is 
plotted  on  they-axis.  (a)  Comparison  between  STAGE  and  ChIP-chip  when 
the  same  sample  was  analyzed  by  both  methods.  The  gray  line  indicates  ad 
predicted  STAGE  targets,  whereas  the  black  line  indicates  only  the  subset 
of  79  target  genes  predicted  by  multiple  tag  occurrences,  (b)  Comparison 
between  STAGE  and  ChIP-chip  when  different  ChIP  samples  Were  analyzed. 


c  of  each  promoter  by  TBP  was  indicated  by  the  rank  of  its 
ijj  enrichment  in  ChIP  relative  to  the  reference16. 

3  STAGE  identified  increasing  numbers  of  genes  as  TBP  targets 
a.  with  increasing  enrichment  in  ChIP  as  measured  by  microarrays 
£  (Pig.  2a).  This  relationship  was  more  pronounced  when  we  con- 
■55  sidered  only  genes  that  were  identified  as  targets  by  more  than  one 
jjj  tag  occurrence  (Fig.  2a).  Among  the  putative  TBP  targets  repre- 
g  sented  by  at  least  two  tag  occurrences,  92%  had  high  enrichment 
o*  values  (>90)  in  ChIP-chip.  When  the  two  ChIP  samples  were 
independently  generated,  91%  of  the  targets  predicted  by  at  least 
two  STAGE  tag  occurrences  showed  high  ChIP-chip  enrichment 
'  values  (Fig.  2b).  Thus,  identification  of  chromosomal  targets  by 
STAGE  correlates  well  with  that  by  ChIP-chip,  especially  when  the 
target  genes  were  designated  by  multiple  occurrences  of  STAGE  tags. 


STAGE  in  human  cells 

We  chose  transcription  factor  E2F4  to  test  STAGE  in  human  cells. 
E2F4  is  a  member  of  the  E2F  family  of  transcriptional  regulators 
that  functions  as  a  repressor  in  quiescent  and  early  Gi  cells17.  We 
first  used  ChIP  and  promoter-specific  PCR  to  verify  the  binding  of 
E2F4  to  known  target  promoters7  (Fig.  3a).  We  then  constructed  a 
human  E2F4  STAGE  pool  from  these  validated  ChIP  samples. 

To  reduce  and  account  for  background  genomic  DNA  in  ChIP, 
we  introduced  two  enhancements.  First,  we  tested  a  subtraction 
step  as  a  potential  means  of  reducing  background  from  nonspecific 
genomic  loci.  Briefly,  DNA  fragments  enriched  by  ChIP  were 
randomly  amplified  by  PCR  with  degenerate  primers,  and,  in 
parallel,  sheared  genomic  DNA  fragments  were  amplified  using 
biotinylated  degenerate  primers.  ChIP  DNA  was  hybridized  to  an 
excess  of  biotinylated  genomic  DNA  and  biotin-containing  hetero¬ 
duplexes  were  removed  by  binding  to  strep tavidin  beads.  The 
remaining  DNA  was  used  as  the  input  for  STAGE  Details  of  the 
subtraction  procedure  are  given  in  Supplementary  Methods 
online.  In  a  ChIP  sample  where  the  enrichment  of  an  E2F4  target 


a 


Figure  3  |  ChIP  of  E2F4  taigets  and  validation  of  STAGE  targets  by  ChIP-PCR. 
(a)  Binding  of  human  E2F4  to  known  target  promoters  in  fibroblasts.  PCR  was 
performed  using  primers  corresponding  to  the  promoters  of  the  indicated 
genes.  The  ninth  exon  of  CCNB1  was  used  as  a  negative  control  for  ChIP 
enrichment  (NCl).  (b)  The  subtraction  procedure  leads  to  improved 
enrichment  of  the  RAD54L  promoter  in  ChIP.'M'  is  a  size  ladder,  (c)  Validation 
of  STAGE  targets  by  ChIP-PCR.  A  subset  of  18  promoters  out  of  the  45 
predicted  by  STAGE  were  randomly  chosen.  E2F4  binding  to  the  promoters  of 
foe  indicated  genes  was  assayed  by  promoter-specific  PCR.  NCI  is  the  ninth 
exon  of  CCNB1  and  NC2  is  the  promoter  of  ACTB;  both  are  negative  controls. 
SNRPDZ  and  QPCTL  are  divergently  transcribed.  The  putative  targets  of  E2F4 
predicted  by  SubSTAGE  are  marked  by  an  asterisk. _ 


over  background  was  originally  suboptimal,  we  observed  improved 
enrichment  after  subtraction  (Fig.  3b).  Tags  from  this  E2F4 
subtraction  STAGE  (SubSTAGE)  pool  were  combined  for  analysis 
with  STAGE  tags  obtained  without  the  subtraction  step. 

Additionally,  we  performed  STAGE  on  normal,  unselected 
human  genomic  DNA  to  profile  tags  arising  from  background 
genomic  DNA  that  was  not  enriched  by  ChIP.  This  background 
STAGE  pool  would  thus  serve  as  an  analysis  control  to  account  for 
sampling  of  STAGE  tags  from  highly  repetitive  regions  of  the 
genome.  We  analyzed  approximately  3,500  valid  tags  to  identify 
targets  of  E2F4  in  human  cells. 

Targets  of  human  transcription  factor  E2F4 

To  overcome  the  ambiguity  inherent  in  mapping  many  21-bp  tags 
to  specific  locations  on  the  human  genome,  we  developed  an 

algorithm  to  score  tags  and  genes  as  putative  taigets.  Each  distinct 
tag  was  assigned  a  tag  score  based  on  the  number  of  its  hits  on  the 
genome  and  the  number  of  its  occurrences  in  the  STAGE  pool 
Details  of  the  scoring  method  are  described  (see  Methods  and 
Supplementary  Methods  online).  A  higher  number  of  hits  on  the 
genome  lowered  the  tag  score,  and  a  higher  occurrence  number  in 
the  STAGE  pool  raised  the  tag  score.  For  each  human  gene  in 
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RefSeq18,19,  a  final  STAGE  enrichment  score  was  generated  that  was 
indicative  of  the  enrichment  of  its  promoter  in  ChIP.  The  final 

STAGE  enrichment  score  for  each  gene  was  calculated  by  dividing 
its  raw  score  from  the  ChIP  STAGE  library  by  its  raw  score  from  the 
appropriate  background  genomic  STAGE  library. 

There  were  48  putative  targets  of  E2F4  with  STAGE  enrichment 
scores  greater  than  a  threshold  of  900  in  either  of  the  two  STAGE 


pools  (Table  2).  Raw  scores  and  final  STAGE  enrichment  scores  are 

available  (Supplementary  Table  2  online).  Most  targets  were 
designated  by  at  least  one  tag  with  a  single  hit  on  the  human 
genome.  In  addition  to  previously  known  targets  of  E2F4  such  as 
RAD54L,  SLC3A2  and  MAP3K7 ,  which  had  been  identified  using  a 
human  core  promoter  micro  array7,  our  analysis  identified  several 
new  targets  that  had  not  been  identified  in  previous  studies.  We  also 


Table  2  |  Human  E2F4  targets  predicted  by  STAGE 


No. 

Gene 

Gene  Scone 

E2F4  site 

Description 

1 

MTUS1 

1971 

Mitochondrial  tumor  suppressor  gene  1 

2 

ULBP3 

1961 

UL16  binding  protein  3 

3 

SNRPD2 * 

1933 

Small  nuclear  ribonucleoprotein  D2  polypeptide,  16.5  kDa 

QPCTL* 

1923 

Hypothetical  protein  FU20084 

4 

PXK 

1015 

PX  domain-containing  serine/threonine  kinase 

5 

FU22353 

993 

Hypothetical  protein  FU22353 

6 

CAJ 

993 

Yes 

GAJ  protein 

7 

ACR 

992 

Acrosin 

8 

RAD54L 

992 

RAD54-like  (S.  cerevisiae) 

9 

AAMP 

982 

Angio-associated  migratory  cell  protein 

10 

ABHD2 

982 

Abhydrotase  domain-containing  2 

11 

BLVRB* 

982 

Biliverdin  reductase  B  (flavin  reductase  (NADPH)) 

SPTBH4* 

982 

Spectrin,  beta,  non-erythrocytic  4 

12 

DC2 

982 

DC2  protein 

13 

FU13912 

982 

Yes 

Hypothetical  protein  FU13912 

14 

FU25416 

982 

Hypothetical  protein  FU25416 

15 

FU32000 

982 

Yes 

Hypothetical  protein  FLJ32000 

16 

FU90834 

982 

Hypothetical  protein  FU90834 

17 

MPV17 

982 

Yes 

MpV17  transgene,  murine  homolog,  glomerulosclerosis 

18 

PRCP 

982 

Prolylcarboxypeptidase  (angiotensinase  C) 

19 

PSMA4 

982 

Proteasome  (prosome,  macropain)  subunit;  alpha  type,  4 

20 

RNF29 

982 

Ring  finger  protein  29 

21 

TVPK 

982 

Yes 

T-LAK  cell-originated  protein  kinase 

22 

DRF1 

974 

Dbf4-re!ated  factor  1 

23 

LM07 

971 

LIM  domain  only  7 

24 

SLC3A2 

971 

Yes 

Solute  carrier  family  3  (activators  of  dibasic  and  neutral  amino  add  transport),  member  2 

|25 

S0AT2 

971 

Sterol  O-acyltransferase  2 

26 

ARHGAP11A 

965 

Yes 

KIAA0013  gene  product 

27 

ABC1 

961 

Amplified  in  breast  cancer  1 

28 

BTRC 

961 

Beta-transdudn  repeat  containing 

29 

GAL3ST1 

961 

Cerebroside  (3'-phosphoadenylylsulfate:galactosylceramide  3~)  sUlfotransferase 

30 

CSTF3 

961 

Cleavage  stimulation  factor,  3'  pre-RNA,  subunit  3,  77  kDa 

31 

CTAG3 * 

961 

Cancer/testis  antigen  3 

RI0K1* 

961 

RIO  kinase  1  (yeast) 

32 

DNALI1 

961 

Dynein,  axonemal,  light  intermediate  polypeptide  1 

33 

EPHA3 

961 

EPH  receptor  A3 

34 

FIBL-6 

961 

Yes 

Hemicentin 

35 

FU20712 

961 

Hypothetical  protein  FU20712 

36 

HIST2H2AC 

961 

Histone  2,  H2ac 

37 

H0XA3 

961 

Homeobox  A3 

38 

3PH2 

961 

Junctophilin  2 

39 

HAP3K7 

961 

Yes 

Mitogen-activated  protein  kinase  kinase  kinase  7 

40 

METAP2 

961 

Yes 

Methionyl  aminopeptidase  2 

41 

PDGFA 

961 

Platelet-derived  growth  factor  alpha  polypeptide 

42 

RPL23A 

961 

Yes 

Ribosomal  protein  L23a 

43 

SNIP1 

961 

Yes 

Smad  nuclear  interacting  protein 

44 

CCRL2 

926 

Chemokine  (C-C  motif)  receptor-like  2 

45 

C20oip41 

913 

Chromosome  20  open  reading  frame  141 
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Figure  4  |  Validation  by  ChIP-chip  of  E2F4  targets 
predicted  by  STAGE.  DNA  from  an  E2F4  CHIP  was 
amplified  and  labeled  with  Cy5,  and  hybridized  to 
a  human  core-promoter  microarray  together  with 
a  mock  IP  sample  labeled  with  Cy3.  The  ratio  of 
Cy5/Cy3  (red/green)  signal  is  an  indicator  of 
the  binding  of  E2F4  to  the  locus  at  a  given  spot 
(a)  New  targets  identified  by  STAGE  (see  Table  2 
and  Fig.  3c)  include  SNRPD2,  QPCTL,  DRF1, 
ARHGAPUA,  TOPK,  CSTF3  and  PSMA4.  Previously 
known  E2F4  targets  that  were  also  identified  by 
STAGE  are  SLC3A2,  RAD54L  and  MAP3K7.  The  ACTB 
promoter  is  a  negative  control,  (b)  Correlation 
between  targets  predicted  by  STAGE  and  ChIP- 
chip.  The  average  percentile  rank  (0-100)  from 
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two  microarray  hybridizations,  of  the  ratio  of  ChIP-enriched  fragments  to  mock  IP  control  DNA  was  determined  for  each  spot  on  the  microarray.  For  each  interval 
of  E2F4  ChIP  enrichment  values  plotted  on  the  x-axis,  the  number  of  targets  predicted  by  STAGE  (total  26)  is  plotted  on  they-axis.  Ten  STAGE  predicted  targets 
rank  in  the  top  5%  of  all  spots  on  the  microarray,  corresponding  to  a  red/green  ratio  >  2.0. _ 


calculated  a  significance  value  for  each  STAGE  enrichment  score.  3-kb  proximal  region  (Table  2).  It  is  possible  tbat  E2F4  binds  to 
All  our  putative  targets  (Table  2)  had  scores  with  P  values  much  multiple  sites  at  varying  distances  upstream  of  some  of  its  target 

lower  than  0.01.  The  score  for  the  ACTB  (P-actin)  gene  used  as  a  genes.  Approximately  1,400  unique  STAGE  tags  were  derived  from 

negative  control  had  a  much  higher  P  value  (P  >  0.5).  regions  of  the  genome  that  were  not  within  10  kb  upstream  of,  or 

in,  the  first  intron  of  any  gene.  Although  we  have  not  validated 
Validation  of  STAGE  in  human  cells  these  as  true  E2F4  binding  sites,  binding  to  sites  outside  promoters 

From  the  45  putative  target  promoters  (Table  2),  we  selected  18  for  would  be  consistent  with  recent  reports  describing  such  binding  by 

validation  by  promoter-specific  PCR.  Primers  were  designed  to  NF-kB9,  c-myc  and  Spl  (ref  10). 
assay  a  region  spanning  the  ~400  bp  upstream  of  the  transcription 
start  site  of  each  gene.  We  detected  E2F4  binding  to  15  promoters  DISCUSSION 

(Fig.  3c).  Including  RAD54L  (Fig.  3a),  we  could  thus  indepen-  Our  results  demonstrate  the  utility  of  STAGE  as  an  unbiased  geno- 

dently  verify  16  of  19  (84%)  binding  targets  predicted  by  STAGE  mic  method  for  identifying  the  chromosomal  binding  targets 

We  used  ChIP-chip  to  further  verify  the  binding  of  E2F4  to  of  proteins.  STAGE  identified  many  new  target  genes  of  E2F4 
promoters  identified  by  STAGE.  DNA  from  an  independent  E2F4  in  human  fibroblasts  that  had  not  been  identified  in  previous 

ChIP  was  amplified  and  labeled  with  Cy5  and  hybridiz.ed  to  a  studies  using  targeted  core  promoter  microarrays  or  CpG 

9,500-element  human  core  promoter  microarray20  together  with  a  island  microarrays7,8. 

mock-immunopredpitated  sample  labeled  with  Cy3.  Marry  pre-  The  fraction  of  orphan  STAGE  tags  that  did  not  match  any 

viously  unknown  E2F4  targets  that  we  identified  by  STAGE  were  genomic  sequence  was  generally  15-19%,  similar  to  what  has  been 

indeed  enriched  in  the  independent  ChIP-chip  as  indicated  by  high  observed  for  SAGE1321.  Orphan  tags  likely  arise  from  a  combina- 

red/green  (Cy5/Cy3)  ratios  (Fig.  4a).  STAGE  identified  increasing  tion  of  PCR  and  sequencing  errors  and  cross-contamination  from 

numbers  of  genes  as  E2F4  targets  with  increasing  enrichment  in  unrelated  DNA  samples.  Half  of  the  22%  orphan  tags  we  observed 

ChIP-chip  (Fig.  4b).  Of  the  48  E2F4  target  genes  identified  by  in  one  instance  in  yeast  consisted  of  repeated  occurrences  of  just 

STAGE,  26  were  represented  on  the  microarray.  Ten  of  these  (38%)  two  distinct  tags.  We  did  not  observe  these  two  tags  in  any  other 

had  ChIP-chip  enrichment  values  in  the  top  5%,  indicating  they  STAGE  pools.  Although  it  is  desirable  to  minimize  the  occurrence 

were  bona  fide  targets.  The  overlap  between  the  targets  identified  by  of  such  orphan  tags,  they  do  not  present  a  problem  for  STAGE,  as 

STAGE  and  by  ChIP-chip,  although  modest,  was  highly  significant  they  are  excluded  from  analysis. 

( P  <  10-7  based  on  sampling  permutation),  showing  that  STAGE  Although  there  was  significant  overlap  ( P  <  10~7)  between  the 

enables  the  identification  of  target  loci  in  human  cells.  This  overlap  E2F4  targets  that  we  identified  by  ChIP-chip  and  by  STAGE,  the 
between  the  targets  identified  by  the  two  different  technologies  is  agreement  between  the  two  technologies  was  not  perfect  ChTP- 
comparable  to  the  43%  overlap  we  observed  between  our  ChIP-  chip  involves  a  complex  hybridization  step  and  can  be  affected  by 

chip  targets  and  the  set  of  E2F4  targets  previously  reported  in  the  the  presence  of  repetitive  DNA,  poor  PCR  product  in  the  micro¬ 
literature  also  using  ChIP-chip7,8.  array  spot,  differential  amplification  of  ChIP  DNA  during  fluo- 

In  addition  to  the  identification  of  E2F4  targets  based  on  the  rescent  labeling  and  hence  low  sensitivity  or  specificity  at  certain 
occurrence  of  tags  within  a  3-kb  window  proximal  to  annotated  loci.  For  example,  we  identified  PSMA4  as  an  E2F4  target  by  STAGE 

genes,  we  separately  scored  genes  as  putative  targets  based  on  the  and  validated  it  by  ChIP-PCR,  but  it  showed  only  marginal 

presence  oftags  within  a  region  from -10  kb  to -6  kb  or  from -6  kb  enrichment  in  ChIP-chip  (Fig.  4a).  However,  MAP3K7,  a  pre- 

to  -2  kb  relative  to  the  start  of  transcription  or  within  the  first  viously  known  target  of  E2F4  that  we  also  identified  by  STAGE, 

intron.  These  analyses  identified  48,  43  and  1 7  additional  putative  likewise  did  not  show  enrichment  in  our  ChIP-chip,  indicating  that 

targets,  respectively  (Supplementary  Thbles  3,4  and  5).  Some  of  ChIP-chip  is  not  infallible.  For  this  reason,  we  believe  that  the 

these  additional  putative  targets,  such  as  ACR,  FLJ22353  and  standard  low-throughput  ChIP-PCR  assay  is  a  more  reliable 

ULBP3 ,  had  also  been  identified  in  our  analysis  based  on  the  measure  of  whether  a  locus  is  a  true  binding  target 
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Based  on  our  ChTP-PCR  analysis  (Fig.  3a,c),  we  estimate  the  true 
positive  rate  of  STAGE  in  human  cells  is  ~  84%  in  our  experiments. 
This  success  rate  can  potentially  be  improved  by  enhancements  to 
the  analysis  algorithms  as  well  as  improvements  to  the  ChIP 
procedure  to  reduce  nonspecific  DNA  background.  Subtraction  is 
one  potential  means  of  reducing  background.  However,  it  is 
possible  that  the  subtraction  step  may  be  effective  only  when  the 
initial  ChIP  enrichment  is  poor  (Fig.  3b).  The  use  of  new  type  HI 
restriction  enzymes  generating  26-bp  tags  rather  than  21-bp  tags 
may  also  improve  the  specificity  of  STAGE22.  However,  70%  of  all 
MoTO-anchored  21-bp  tags  in  the  human  genome  were  unique, 
whereas  76%  of  all  such  26-bp  tags  were  unique.  The  improvement 
in  the  ability  to  uniquely  localize  tags  by  increasing  their  lengths 
from  21  to  26  bp  is  therefore  not  likely  to  be  dramatic. 

The  comprehensiveness  of  STAGE,  by  analogy  to  SAGE,  is 
limited  in  principle  only  by  the  extent  of  sequencing.  We  identified 
dozens  of  new  E2F4  targets  after  sequencing  a  few  thousand  STAGE 
tags,  but  we  believe  our  coverage  is  not  saturating  for  two  reasons. 
First,  we  observed  minimal  overlap  between  the  tags  generated 
from  the  two  independent  STAGE  pools  and  saw  no  significant 
overlap  between  their  predicted  targets,  even  though  we  verified 
targets  from  each  pool.  Thus,  our  sampling  of  tag  space,  although 
valid,  is  relatively  sparse.  Second,  a  substantial  fraction  of  the  tags  in 
all  the  combined  human  STAGE  pools  was  observed  only  once. 
These  observations  suggest  that  E2F4  STAGE  tags  generated  by 
further  sequencing  are  likely  to  be  unique  and  will  help  predict 
additional  target  genes.  One  way  to  estimate  the  false  negative  rate 
in  future  studies  would  be  to  compare  the  predictions  from  STAGE 
after  saturation  sequencing,  with  predictions  made  by  analyzing 
ChIP  on  complete  tiling  microarrays  for  a  given  chromosome9. 

STAGE  has  many  advantages  for  the  analysis  of  genome-wide 
DNA  protein  interactions,  especially  in  large  genomes.  First,  it  does 
not  make  assumptions  about  the  location  of  protein  binding  sites 
on  the  genome.  98%  of  the  human  genome  is  within  1  kb  of  an 
NMH  site,  so  binding  sites  anywhere  can  potentially  be  sampled  by 
STAGE  Second,  it  does  not  require  expensive  infrastructure.  We 
estimate  that  sequencing  30,000  tags,  which  should  allow  for 
extensive  coverage  of  the  targets  of  a  single  protein,  will  entail 
sequencing  about  1,200  clones,  a  cost-efficient  option.  Third, 
STAGE  is  readily  applicable  to  any  sequenced  organism.  Finally, 
STAGE  is  not  restricted  to  a  specific  annotation  of  a  genome;  as  new 
transcriptional  units  are  discovered  and  existing  ones  become 
defunct23,24,  the  same  STAGE  tag  data  can  be  reanalyzed  to  identify 
targets  based  on  revised  genome  annotations. 

We  envision  STAGE  as  a  useful  complement  to  ChIP-chip  for 
analyzing  the  binding  distribution  of  proteins  on  the  genome. 
Although  STAGE  is  a  high-throughput  genomic  method,  it  is  less 
suited  than  ChIP-chip  for  repeated  quantitative  measurements  of 
the  binding  of  a  protein  under  a  range  of  physiological  conditions. 
However,  fire  binding  loci  predicted  by  STAGE  can  be  represented 
on  focused  microarrays  for  ChIP-chip.  Thus,  an  initial  compre¬ 
hensive  survey  of  direct  binding  targets  by  STAGE  followed  by 
extensive  ChIP-chip  analysis,  can  accelerate  the  discovery  of  pro¬ 
tein-binding  regulatory  elements  in  genomes. 

METHODS 

Cells  and  antibodies.  Yeast  cells  with  a  3  x  hemagglutinin  (HA)- 
tagged  TBP25  were  grown  at  25  °C  in  synthetic  complete  medium 
minus  uracil,  collected  by  centrifugation,  resuspended  in  an 


equal  volume  of  prewarmed  39  °C  medium  and  returned  to 
39  °C.  After  10  min,  cells  were  cross-linked  by  adding  formalde¬ 
hyde  (final  1%).  Anti-HA  antibody  (Santa  Cruz)  at  a  1:100  dilution 
was  used  for  ChIP. 

Human  foreskin  fibroblasts  (ATCC  CRL  2091)  were  grown  to 
60%  confluence  in  15  cm  plates  in  DMEM  containing  glucose 

(1  g/1),  antibiotics,  and  10%  FBS  (Hydone).  Cells  were  washed 
twice  with  the  same  medium  lacking  FBS  and  low-serum  medium 
(0.1%  FBS)  was  added.  After  72  h,  cells  were  cross-linked  with 

formaldehyde  (final  1%).  Anti-E2F4  antibody  (sc-1082x,  Santa 

Cruz)  at  a  1:100  dilution  was  used  for  ChIP. 

STAGE  and  SubSTAGR  Cross-linking,  ChIP,  and  amplification  of 

ChIP  DNA  was  performed  as  described  previously26,  except  using 
a  5'-biotinylated  primer  during  amplification.  Further  details  of 
ChIP  protocols  are  described  in  Supplementary  Methods  online. 
We  then  followed  the  LongSAGE  protocol  (http://www.sagenet. 
org/),  except  used  amplified,  biotinylated  ChIP  DNA  as  the 
starting  material.  Briefly,  amplified  DNA  (1-2  pg)  was  digested 
with  Nlain.  The  terminal  DNA  fragments  were  bound  to  strepta- 
vidin -coated  magnetic  beads  (Dynal)  and  separated  into  two 
tubes.  After  ligation  with  linker  1  or  2,  which  contain  recognition 
sites  for  Mmel,  the  DNA  fragments  were  released  by  Mme I 
digestion.  The  released  tags  were  ligated  to  generate  ditags.  Ditags 
were  amplified  with  nested  primers,  gel  purified  and  trimmed  by 
Main  digestion.  Trimmed  ditags  were  gel  purified,  concatemer- 
ized  by  ligation  and  cloned  into  the  pZero  1.0  vector  (Invitrogen). 
Insert  sizes  were  assayed  in  recombinant  clones  and  clones  con¬ 
taining  at  least  ten  ditags  were  sequenced.  Details  of  the  subtrac¬ 
tion  step  are  provided  in  Supplementary  Methods  online.  For  the 
mock  immunopredpitation  control  and  reference  samples  (Figs.  3 
and  4,  respectively),  the  antibody  was  omitted.  For  the  genomic 
control  STAGE  pool,  sheared  normal  human  genomic  DNA  was 
used  as  input  into  STAGE. 

Data  analysis  and  scoring.  STAGE  yields  a  list  of  tags  with  their 
number  of  occurrences  in  the  pool  This  number  is  termed  nocc. 
Each  valid  STAGE  tag  has  anywhere  between  one  and  several 
thousand  matches  on  the  human  genome.  This  number  is  termed 
nhit.  Our  algorithm  for  defining  target  genes  was  as  follows. 
(1)  Map  the  tags  to  die  human  genome.  (2)  Assign  a  score  to 
each  tag  based  on  nocc  and  nhit  (3)  For  each  human  gene,  identify 
tags  within  a  user-defined  window.  (4)  Calculate  a  cumulative 
score  for  the  gene  based  on  the  scores  of  all  tags  in  the  given 
window.  (5)  Compare  these  scores  to  the  experimental  and 
computational  control.  (6)  Genes  that  show  a  substantially  higher 
score  than  the  control  are  putative  targets.  Further  details  of  the 
scoring  algorithm  are  provided  in  Supplementary  Methods 
online.  For  all  analyses,  we  used  the  July  2003  build  of  the  Human 
Genome  sequence  assembly  available  at  http://genome.ucsc.edu. 
Genes  used  in  our  analysis  were  based  on  the  RefSeq  Genes 
annotation  at  the  University  of  California,  Santa  Cruz19. 

Controls  and  P  values  for  STAGE  enrichment  scores.  For  an 
experimental  control,  we  performed  STAGE  on  input  genomic 
DNA  without  ChIP  and  calculated  background  gene  scores  for  all 
genes  in  the  same  manner  as  described  above  for  STAGE  from  an 
actual  ChIP.  Raw  gene  scores  derived  from  ChIP  STAGE  were 
divided  by  control  scores  to  obtain  the  final  STAGE  enrichment 
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score.  To  calculate  a  P  value  for  the  STAGE  enrichment  score, 
2,000  tags  were  computationally  selected  at  random  from  the 
redundant  pool  of  all  CATG  21-men  in  the  genome  and  used  to 
generate  scores  for  each  gene  as  described  above.  This  process  was 
iterated  500  times  to  obtain  a  distribution  of  500  scores  for  each 
gene.  For  each  gene,  these  scores  were  fitted  to  a  normal  distribu¬ 
tion.  The  experimentally  determined  STAGE  enrichment  score  for 
a  particular  gene  was  compared  to  this  distribution  and  a  P  value 
for  the  score  was  obtained.  Experimental  scores  with  P  values  less 
than  0.01  were  taken  to  be  significant. 

Microarrays.  Yeast  microarrays  including  all  ORFs  and  intergenic 
elements  were  manufactured  as  described  previously5,26.  PCR 
amplification,  fluorescent  labeling  of  ChIP  DNA  fragments  and 
hybridization  were  performed  as  described  previously26.  The 
reference  hybridization  probe  was  generated  from  sonicated  nor¬ 
mal  yeast  genomic  DNA  processed  identically  to  the  probe  for 
CMP  DNA  samples.  A  GenePix  4000B  scanner  and  GenePix  Pro 
4.0  software  (Axon  Instruments)  were  used  for  scanning  and 
quantitation.  Data  were  uploaded  to  a  local  database  for  analysis27. 
The  enrichment  value  of  TBP  CMP  was  calculated  by  ranking 
genomic  loci  according  to  their  red/ green  fluorescence  ratios.  We 
determined  the  percentile  rank  (0-100)  for  each  array  element  and 
either  used  it  directly  as  a  measure  of  binding  (Fig.  2b)  or  used  the 
average  percentile  rank  for  each  element  from  two  replicate 
hybridizations  (Fig.  2a).  When  multiple  microarray  elements 
could  potentially  represent  the  promoter  of  a  gene,  we  averaged 
their  percentile  ranks. 

PCR  primer  pairs  for  human  core  promoters20  were  purchased 
from  the  Whitehead  Institute  (Cambridge,  Massachusetts,  USA). 
Promoters  were  amplified  by  PCR  as  recommended  by  the 
manufacturer,  and  microarrays  were  manufactured  as  previously 
described26.  PCR  products  corresponding  to  33  additional  pro¬ 
moter  and  control  loci,  including  the  genes  listed  in  Supplemen¬ 
tary  Table  6  online,  were  included  on  the  array.  E2F4  ChIP  DNA 
fragments  and  the  mock  IP  reference  samples  were  amplified  and 
led  by  ligation-mediated  PCR,  using  Cy5  and  Cy3,  respec¬ 
tively6.  The  two  fluorescently  labeled  samples  were  simultaneously 
hybridized  to  the  promoter  microarray  and  ChIP  enrichment  of 
target  lod  was  calculated  by  ranking  the  Cy5/Cy3  (red/green) 
fluorescence  ratios. 


PCR  and  primers.  Thirty  cycles  of  PCR  were  performed  for  the 
samples  in  Figure  4  in  a  25-jil  reaction  volume  with  1  pi  (4%)  of 
immunopredpitated  material  Primers  were  designed  to  assay  ap¬ 
proximately  between -400  bp  and +1  ofthe  transcription  start  site. 
The  ninth  exon  of  CCNB1  and  the  core  promoter  of  ACIB  were 
negative  controls  NCI  and  NC2,  respectively  (Figs.  3a-c  and  4a). 
Primer  sequences  are  provided  in  Supplementary  Table  6  online. 

Accession  numbers.  Microarray  data  have  been  deposited  in 
NCBI’s  Gene  Expression  Omnibus  (GEO,  http://www.ncbi.nlm. 
nih.gov/geo/)  and  are  accessible  through  GEO  Series  accession 
number  GSE1861. 

Note:  Supplementary  information  is  available  on  the  Nature  Methods  website. 
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